Forecasting and Modeling Diverse Economic Time Series: Methods and Applications

Author

Stefano Grassi

Published

June 27, 2025

1 Introduction

This project applies statistical time series forecasting techniques to real-world datasets from diverse domains with a focus on financial data, macroeconomic indicators and socioeconomic trends.

The core aims are to:

  • Assess how time series models handle diverse statistical properties, such as trend, volatility and non-stationarity.
  • Evaluate model limitations in real-world forecasting contexts.
  • Simulate forecasts and quantify uncertainty using Monte Carlo methods and kernel density estimation (KDE).
  • Compare model performance across domains and time resolutions.
  • Extend the analysis with more advanced modeling techniques.

1.1 Research Questions

The main research question is the following:

  • How do statistical time series models perform across financial, macroeconomic and socioeconomic datasets, and what are their limitations when forecasting real-world data?

The sub-research questions include:

  1. Are statistical models robust to the noise and trend structures typical in financial and economic data?
  2. Do high-frequency datasets require domain-specific modeling strategies or parameter tuning?
  3. Which models generalize well across domains and under what conditions do they fail?
  4. Can simulation-based techniques (e.g., Monte Carlo forecasting) and kernel density estimation (KDE) provide meaningful uncertainty estimates?

1.2 Notebook Structure

This notebook is organized as follows:

  • Part 1: includes core methods such as stationarity testing, classical forecasting methods, AR modeling and Monte Carlo simulations with KDE.
  • Part 2: fits AR(p) models and simulates forward paths to estimate uncertainty using higher-frequency data.
  • Part 3: includes advanced exploration by applying additional methods like ARIMA and ARCH/GARCH.

Lastly, a final reflection synthesizes results, highlights limitations and suggests directions for future work.

1.3 Experimental Settings

All experiments are conducted using Python on a Mac equipped with Apple M4 hardware. The forecasting tasks leverage the StatsForecast library (Garza et al., 2022), in combination with widely used Python packages such as pandas, statsmodels and others for data manipulation and statistical analysis. The full codebase is made available to ensure transparency and reproducibility.

1.4 Forecasting Horizons

Forecasts are evaluated at a one-step-ahead horizon, classifying them as short-term forecasts. Although it would have been possible to consider longer horizons, the decision to focus on a single-step forecast was made to ensure consistency in analysis and comparison. This choice is particularly important given that some of the time series are low-frequency and relatively short in length, which makes forecasting over longer horizons more challenging. Additionally, Monte Carlo simulations are used to generate multi-step forecast paths, typically over a 15-step horizon, to assess uncertainty and potential future scenarios.

2 Part 1

2.1 Data Collection and Visualization

2.1.1 Dataset Overview

The dataset (see Table 1) is composed of a diverse portfolio of time series, both short and long, sourced from Alpha Vantage, Federal Reserve Economic Data (FRED) and the World Bank, covering the financial, macroeconomic and socioeconomic domains. This variety allows for a comprehensive evaluation of classical time series models across different structural and statistical properties.

Table 1: Structured Metadata of Selected Time Series
Code Name Domain Data Type Description Source Frequency Start Date End Date
SPY SPDR S&P 500 ETF Trust Finance Price Daily closing price (USD) of SPY, an ETF that seeks to provide investment results that correspond to the S&P 500 Index Alpha Vantage Daily 1999-11-01 2025-06-04
EWJ iShares MSCI Japan ETF Finance Price Daily closing price (USD) of EWJ, an ETF that seeks to track the performance of Japanese equities Alpha Vantage Daily 1999-11-01 2025-06-04
AAPL Apple Inc. Stock Finance Price Intraday 1-minute closing prices (USD) for Apple Inc. stock, a leading consumer electronics and software company Alpha Vantage Intraday (1-min) 2025-06-13 18:19:00 2025-06-13 19:59:00
MSFT Microsoft Corporation Stock Finance Price Intraday 1-minute closing prices (USD) for Microsoft Corporation stock, a major software and services provider Alpha Vantage Intraday (1-min) 2025-06-13 18:19:00 2025-06-13 19:59:00
US GDP Gross Domestic Product (GDP) Macroeconomics Value Real GDP in billions of chained dollars (seasonally adjusted annual rate)Reflects the market value of all goods and services produced in the US, adjusted for inflation and seasonal effects FRED, U.S. Bureau of Economic Analysis Quarterly 1946-01-01 2025-01-01
US CPI Consumer Price Index (CPI) Macroeconomics Index Consumer Price Index for All Urban Consumers: All Items (CPIAUCSL), measured as an index with base 1982–1984 = 100, seasonally adjustedReflects changes in the price level of a representative basket of goods and services paid by urban consumersPercent changes in the index measure inflation rates over time FRED, U.S. Bureau of Labor Statistics Monthly 1947-01-01 2025-04-01
TH Unemployment Thailand Unemployment Rate Labor Rate Unemployment rate (% of labor force), modeled ILO estimate World Bank Annual 2010-01-01 2024-01-01
TH Labour Force Thailand Labour Force Participation Labor Rate Labor force participation rate (% of population ages 15+), ILO estimate World Bank Annual 2010-01-01 2024-01-01

2.1.2 Data Loading and Preprocessing

The time series are retrieved using their respective APIs and libraries, then preprocessed and formatted into consistent, time-indexed structures suitable for analysis and modeling. In the case of U.S. GDP, missing values prior to 1947 are removed, as their exclusion does not affect the analysis. For datasets obtained from Alpha Vantage and FRED, the inability to specify predefined time windows requires exporting the data to CSV files and subsequently reloading them to ensure reproducibility.

With regard to intraday data, missing values are handled using simple imputation techniques. For Apple, the first data point in the series is missing and is imputed using backward fill. In contrast, the Microsoft series contains a missing value within the body of the data, which is imputed using forward fill. While these methods are relatively straightforward, they are effective for initial analysis. More sophisticated techniques such as the Kalman filter or ARIMA-based imputation could be implemented in future work to enhance accuracy and robustness.

2.1.3 Visualization and Summary Statistics

Prior to further preprocessing and modeling, each time series is visualized in Figure 1 to assess key characteristics such as trends, seasonality and volatility. Complementing this visual inspection, Table 2 reports fundamental summary statistics, mean, standard deviation (volatility) and coefficient of variation for each series. These initial descriptive metrics provide a preliminary understanding of the data’s scale and variability, serving as a basis for subsequent analytical decisions.

Figure 1: Line plots illustrating the selected time series datasets used in the analysis

More detailed statistical analyses, including stationarity tests and assessments of seasonal and trend components, are presented in Section 2.2. These diagnostics support the application of appropriate data transformations and scaling procedures critical for effective forecasting model development.

Table 2: Summary Statistics of the Selected Time Series
Series Mean Volatility (Std Dev) Coefficient of Variation (CV)
SPY 216.9960 131.5581 0.6063
EWJ 27.5043 23.6604 0.8602
AAPL Intraday 196.4459 0.0609 0.0003
MSFT Intraday 474.3858 0.1777 0.0004
US GDP 7523.4688 7855.7088 1.0442
US CPI 122.5739 88.0061 0.7180
TH Unemployment Rate 0.7309 0.2224 0.3043
TH Labor Force Participation 68.7907 2.1758 0.0316

Figure 1 and Table 2 reveal substantial differences in scale, dispersion and underlying dynamics across the selected series:

  • SPY and EWJ, two equity indices observed at daily frequency on trading days, exhibit large magnitudes and relatively high volatility. Notably, EWJ’s coefficient of variation (CV = 0.86) exceeds that of SPY (CV = 0.61), indicating greater proportional variability in the Japanese equity market. SPY shows a consistent upward trend, especially following the 2008 financial crisis, reflecting the sustained recovery of U.S. markets. In contrast, EWJ displays a structural break in November 2016, coinciding with a reverse stock split announced in October and implemented in early November 2016 (MarketBeat.com, 2025).
  • AAPL and MSFT intraday prices, despite their high absolute levels, exhibit extremely low standard deviations and coefficients of variation (on the order of 0.0003–0.0004), as expected for high-frequency data sampled at 1-minute intervals. Price changes over such short horizons tend to be small and primarily driven by noise. However, visual inspection suggests that AAPL shows a mild downward trend over the sampled day, whereas MSFT displays more acute fluctuations and possible intraday seasonality. These patterns, while suggestive, should be interpreted cautiously given the limited time span, just a single trading day, which restricts the ability to generalize or robustly detect recurring intraday dynamics.
  • U.S. GDP has the highest coefficient of variation (CV = 1.04), reflecting both its scale and structural changes over time. While GDP generally trends upward, two marked contractions are evident: during the 2008 financial crisis and the 2020 COVID-19 pandemic. As a quarterly, seasonally adjusted series, GDP shows pronounced trend and cyclical components linked to business cycles.
  • U.S. CPI demonstrates moderate volatility with a CV of 0.72. The series displays a steady upward trajectory typical of price-level indices, with a notable acceleration post-2020, likely related to pandemic-driven inflationary pressures. The CPI data are generally seasonally adjusted, although some components may retain residual seasonal effects.
  • Among socioeconomic indicators, the Thai Unemployment Rate exhibits moderate relative variability (CV = 0.30), with fluctuations plausibly driven by economic shocks such as the COVID-19 pandemic. Conversely, the Labor Force Participation Rate is highly stable (CV = 0.03) and shows a gradual downward trend, consistent with demographic factors including population aging and changing labor market structures.

Collectively, these preliminary insights guide subsequent preprocessing and model specification decisions. Typically, financial and macroeconomic series, characterized by nonstationarity and volatility, may require differencing, detrending or variance-stabilizing transformations (e.g., logarithmic). In contrast, more stable socioeconomic indicators may need minimal preprocessing, though their limited variability presents distinct modeling challenges. The next section introduces formal statistical assessments, including stationarity tests and seasonal decomposition, to inform appropriate data transformations.

2.2 Seasonal Adjustment, Transformation and Stationarity

Before applying transformations and ensuring stationarity for later modeling, all series are first examined for seasonality. Macroeconomic indicators are already seasonally adjusted. Socio-economic indicators, such as labor market statistics, are available only at annual frequency, making seasonality detection and adjustment impractical. Financial intraday series, while potentially affected by microstructure noise, are too short to exhibit meaningful seasonal patterns.

The focus is therefore placed on SPY and EWJ, which are available at daily frequency and have longer histories. Seasonal strength is assessed using STL decomposition (Cleveland et al., 1990), following the method proposed by Hyndman et al. (2021) and measured using the Equation 1:

\[ F_S = \max\left(0, 1 - \frac{\text{Var}(S_t + R_t)}{\text{Var}(R_t)}\right) \tag{1}\]

where \(( S_t )\) is the seasonal component and \(( R_t )\) is the remainder component from the STL decomposition. As shown in Table 3, both daily series exhibit seasonal strength values close to zero, indicating that seasonal adjustment is unnecessary.

Table 3: Seasonal Strength of SPY and EWJ Indices
Series Seasonal Strength
SPY 0.000000
EWJ 0.055631

Before proceeding with the preprocessing steps, the datasets are split into training and test sets, with the test sets corresponding to a forecasting horizon of one. This approach not only follows best practices in time series modeling but also helps prevent data leakage, which could otherwise lead to misleading model performance.

The preprocessing methodology involves transforming the training sets using the natural logarithm, except for those already expressed as rates (e.g., Thai unemployment rate and labor force participation), where log transformation is not appropriate. First-order differencing is then applied to remove trends and achieve stationarity, a standard approach in time series analysis (Hamilton, 1994).

Outliers are identified using the Interquartile Range (IQR) method, typically with a multiplier of 3, which effectively detects extreme observations in economic and financial data. For labor market statistics, a more conservative multiplier of 1.5 is employed to avoid smoothing out meaningful variability in the data.

Outlier values are replaced via linear interpolation, providing a straightforward and transparent adjustment method without imposing complex assumptions. Stationarity is assessed using the Augmented Dickey-Fuller (ADF) test (Dickey & Fuller, 1981), complemented by visual inspection to reduce the risk of false positives caused by structural breaks or seasonal patterns.

While alternative tests such as the KPSS exist, the ADF test remains preferred in this context due to its widespread acceptance in economic time series analysis. Specific challenges during preprocessing included the Thai Labor Force Participation series, which initially appeared to require second-order differencing without becoming stationary but was ultimately rendered stationary through first-order differencing combined with outlier replacement, and the AAPL intraday series, which, despite passing the ADF test, exhibited trending behavior upon visual inspection and thus required differencing.

Table 4 summarizes the preprocessing steps and transformations applied.

Table 4: Summary of Stationarity and Transformation Results for Time Series
Series Log Applied Differenced Diff Order Final Stationary?
0 SPY True True 1 True
1 EWJ True True 1 True
2 AAPL Intraday True True 1 True
3 MSFT Intraday True True 1 True
4 US GDP True True 1 True
5 US CPI True True 1 True
6 TH Unemployment Rate False True 1 True
7 TH Labor Force Participation False True 1 True

The preprocessing approach yields time series that are broadly stationary and suitable for subsequent modeling steps, as illustrated in Figure 2. However, for some series, such as US GDP and US CPI, stationarity appears visually questionable, suggesting that the ADF test may have produced false positives. Nonetheless, these series are considered acceptable for modeling in the context of this study.

Figure 2: Line plots of the transformed training sets used for modeling and forecasting.

2.3 Classical Forecasting Methods

In this section, four classical forecasting methods are explored, each providing a one-step-ahead forecast. These models are intentionally applied to raw, untransformed data, as they are designed for simplicity and do not account for structural features like trend or volatility:

  • Historical Average: the historical average is a simple mean of all past observations.
  • Naive Method: the forecasted value is equal to the previous observed value.
  • Random Walk with Drift: similar to the Naive method, this approach includes a stochastic component and a drift parameter to account for an average increase or decrease over time.
  • Moving Average (Window Average): this method calculates the forecast as the mean of the most recent observations. In this specific case, an arbitrary window size of 3 is used.

The Seasonal Naive method was excluded from this analysis due to the lack of strong, consistent seasonality across all series, given their varying frequencies and domains.

The accuracy of these methods is assessed using three standard metrics in time series forecasting: Mean Absolute Error (MAE), Root Mean Squared Error (RMSE) and Mean Absolute Percentage Error (MAPE).

A summary of the results is presented in Table 5.

Table 5: Evaluation metrics (MAE, MAPE and RMSE) for classical forecasting models
unique_id metric Naive HistoricAverage RWD WindowAverage
SPY mae 6.750000 380.062997 6.822677 5.730000
SPY mape 0.011307 0.636621 0.011428 0.009598
SPY rmse 6.750000 380.062997 6.822677 5.730000
EWJ mae 0.610000 46.102891 0.619249 0.320000
EWJ mape 0.008288 0.626398 0.008414 0.004348
EWJ rmse 0.610000 46.102891 0.619249 0.320000
AAPL Intraday mae 0.020000 0.046325 0.022222 0.053367
AAPL Intraday mape 0.000102 0.000236 0.000113 0.000272
AAPL Intraday rmse 0.020000 0.046325 0.022222 0.053367
MSFT Intraday mae 0.010000 0.177587 0.013939 0.045000
MSFT Intraday mape 0.000021 0.000374 0.000029 0.000095
MSFT Intraday rmse 0.010000 0.177587 0.013939 0.045000
US GDP mae 252.774000 22525.134471 157.980752 604.807333
US GDP mape 0.008432 0.751423 0.005270 0.020176
US GDP rmse 252.774000 22525.134471 157.980752 604.807333
US CPI mae 0.259000 198.216794 0.059255 0.676333
US CPI mape 0.000808 0.618307 0.000185 0.002110
US CPI rmse 0.259000 198.216794 0.059255 0.676333
TH Unemployment Rate mae 0.040000 0.040571 0.048538 0.269667
TH Unemployment Rate mape 0.057720 0.058545 0.070041 0.389129
TH Unemployment Rate rmse 0.040000 0.040571 0.048538 0.269667
TH Labor Force Participation mae 0.409000 2.466143 0.042923 0.521333
TH Labor Force Participation mape 0.006151 0.037091 0.000646 0.007841
TH Labor Force Participation rmse 0.409000 2.466143 0.042923 0.521333

The forecasting performance across evaluated methods reveals consistent patterns that correspond to the structural characteristics of each time series.

The Historic Average method consistently performs the worst, particularly on series with strong directional trends, such as SPY, EWJ, US GDP, US CPI and Thailand’s labor force participation. As shown in Figure 3, this approach yields substantial forecast errors, most notably for US GDP, where the RMSE exceeds 22,500. This underperformance is expected, as the Historic Average simply projects the unconditional mean of past values, rendering it incapable of tracking time-varying trends or persistent movements. This results in large bias for non-stationary series, such as stock prices and macroeconomic aggregates.

By contrast, the Naive and Random Walk with Drift (RWD) methods perform significantly better on trending series. For example, the RWD model achieves the lowest MAE for US GDP (157.98) and US CPI (0.059), outperforming all others. Similarly, for SPY both Naive (MAE = 6.75) and RWD (MAE = 6.82) far surpass the Historical Average (MAE = 380.06). These models benefit from their simplicity and responsiveness: Naive forecasts the most recent value, while RWD incorporates a drift term, enabling it to capture gradual directional shifts more effectively.

High-frequency data, such as AAPL and MSFT intraday series, show much smaller absolute errors across all models, reflecting their scale. Here, differences between methods narrow: for AAPL Intraday, Naive and RWD yield nearly identical MAEs (0.02 and 0.022, respectively), and all models perform acceptably in line with expectations under the Efficient Market Hypothesis, which posits that price movements in liquid financial markets are largely unpredictable.

For more stable or mean-reverting series, such as the Thai Unemployment Rate, all models perform comparably well in absolute terms. The Naive method slightly outperforms the others (MAE = 0.040), while the Window Average performs worse (MAE = 0.269), suggesting over-responsiveness to short-term fluctuations in otherwise stable series.

The Window Average, which relies on a narrow window of three past values, shows mixed results. While it outperforms the Historic Average in trending series (e.g., SPY: MAE = 5.73 vs. 380.06), it remains less effective than Naive and RWD. Its high responsiveness allows it to adapt faster than a global mean, but its short memory fails to capture longer-term dynamics, leading to instability and higher errors in both trending and stable environments.

Overall, model performance is highly dependent on the data’s underlying structure. Simpler models like Naive and RWD consistently outperform in dynamic or trending series, while more rigid or overly simplistic models like the Historic Average are unable to accommodate such behaviors.

Figure 3: Forecast accuracy comparison across models and selected time series.

In conclusion, these results forcefully underscore the critical importance of accounting for trend when forecasting time series data. While simple methods like the Historic Average and a very short Window Average can serve as useful benchmarks and perform adequately on truly stationary or extremely stable series, they are fundamentally limited in their capacity to forecast series with complex patterns, particularly those exhibiting strong trends. For such series, models that inherently incorporate trend-following mechanisms, such as the Naive and Random Walk with Drift, prove to be far more effective, often serving as stronger benchmarks themselves due to their ability to capture underlying trends or random walk behavior. Ultimately, the insights gained from evaluating these foundational methods are invaluable, as they help to set performance expectations and highlight when more advanced time series techniques are necessary for capturing intricate data dynamics and achieving better forecasting performance.

2.4 PACF and AR Modeling

This section delves into PACF analysis and AR(p) modeling. A key step before fitting autoregressive (AR) models is to examine the partial autocorrelation function (PACF) of the stationary time series to determine the appropriate lag order \(p\). The PACF is computed using the Yule-Walker method (YWM) (Walker, 1931; Yule, 1927), which provides stable estimates for regularly spaced time series. As shown in Figure 4, six lags are used for comparison, chosen based on the typical length of the high-frequency series and to maintain consistency across assets.

Figure 4: Partial Autocorrelation Function (PACF) of the training set for the selected time series

Given the PACF plots in Figure 4, determining the AR order is more challenging than in idealized theoretical cases where the correlogram shows a clear cutoff, directly indicating the lag order \(p\) of the AR model. In real-world time series, the PACF often decays gradually or exhibits ambiguous patterns, making visual identification difficult. Consequently, it is generally more reliable to fit AR models on a validation or test set and select the optimal lag order based on information criteria such as AIC, which penalizes model complexity, or forecast accuracy metrics like RMSE.

For example, the PACF of Thailand’s Unemployment Rate and Labor Force Participation suggests a lack of significant autocorrelations across lags, consistent with white noise or a weak autoregressive structure. In contrast, US GDP displays potential significant lags at 1, 2 and 6, while US CPI shows notable partial autocorrelations at lags 1 through 6. Stock indices such as SPY and EWJ resemble white noise patterns, with a possible lag at 1. For intraday data, AAPL indicates potential lags at 1 and 2, whereas MSFT’s PACF suggests lag 1 (borderline significant) and lag 5 as candidates.

Based on these observations, a simple AR(1) model was fitted to a subset of the eight time series and compared against the previously evaluated models. The choice of AR(1) was deliberate: as the most basic autoregressive model, it serves as a useful benchmark to assess improvements over classical approaches while avoiding unnecessary complexity. A summary of the AR(1) model evaluation results is presented in Table 6.

Table 6: Evaluation metrics (MAE, MAPE and RMSE) for classical forecasting models and AR(1)
unique_id metric Naive HistoricAverage RWD WindowAverage AR(1)
AAPL Intraday mae 0.020000 0.046325 0.022222 0.053367 0.031102
AAPL Intraday mape 0.000102 0.000236 0.000113 0.000272 0.000158
AAPL Intraday rmse 0.020000 0.046325 0.022222 0.053367 0.031102
EWJ mae 0.610000 46.102891 0.619249 0.320000 0.599073
EWJ mape 0.008288 0.626398 0.008414 0.004348 0.008140
EWJ rmse 0.610000 46.102891 0.619249 0.320000 0.599073
MSFT Intraday mae 0.010000 0.177587 0.013939 0.045000 0.035309
MSFT Intraday mape 0.000021 0.000374 0.000029 0.000095 0.000074
MSFT Intraday rmse 0.010000 0.177587 0.013939 0.045000 0.035309
SPY mae 6.750000 380.062997 6.822677 5.730000 6.996739
SPY mape 0.011307 0.636621 0.011428 0.009598 0.011720
SPY rmse 6.750000 380.062997 6.822677 5.730000 6.996739
US CPI mae 0.259000 198.216794 0.059255 0.676333 0.524853
US CPI mape 0.000808 0.618307 0.000185 0.002110 0.001637
US CPI rmse 0.259000 198.216794 0.059255 0.676333 0.524853
US GDP mae 252.774000 22525.134471 157.980752 604.807333 151.578270
US GDP mape 0.008432 0.751423 0.005270 0.020176 0.005057
US GDP rmse 252.774000 22525.134471 157.980752 604.807333 151.578270

As Table 6 suggests, across the majority of series and error metrics (MAE, MAPE, RMSE), the AR(1) model demonstrates competitive performance but does not consistently outperform the simpler benchmarks.

For high-frequency intraday series like AAPL and MSFT, the Naive and RWD methods achieve the lowest errors overall. The AR(1) model performs reasonably well, surpassing the Window Average method but falling short of former mentioned methods, which better capture the short-term persistence inherent in these series.

In the case of stock indices such as EWJ and SPY, the AR(1) model’s performance is comparable to the Naive and RWD approaches but generally does not provide clear improvements. For example, in SPY, the Window Average method yields the lowest MAE and RMSE, suggesting that simple smoothing over recent observations might be more effective than a low-order AR model for this noisy financial data.

For macroeconomic series such as US GDP and US CPI, the AR(1) model offers notable improvements over the Naive and Historical Average methods. Specifically, for US GDP, the AR(1) model achieves the lowest RMSE and MAE values (151.58 and 151.58, respectively), outperforming even the RWD method. This indicates that the AR(1) structure can capture some of the persistence in these slower-moving series more effectively than simpler models. However, for US CPI, the AR(1) model performs worse than the RWD and Naive methods, suggesting that additional complexity or higher-order models might be necessary.

To further assess model adequacy, residual diagnostics were conducted using the Ljung-Box test (Ljung & Box, 1978) at 3 lags, as reported in Table 7. For most series, including SPY, EWJ and MSFT, the high p-values (p > 0.1) indicate no significant autocorrelation in the residuals, suggesting that the AR(1) model adequately captures the dynamics. However, for AAPL and US CPI, the test returns p-values below conventional significance thresholds (e.g., p < 0.05), implying that residuals still contain autocorrelation. This points to possible model misspecification, where a higher-order AR process or alternative modeling strategy may be more appropriate to fully capture the underlying structure.

Table 7: Ljung-Box test results for AR(1) residuals using 3 lags
Series lb_stat lb_pvalue
SPY 0.000234 0.987802
SPY 0.160992 0.922659
SPY 1.437902 0.696675
EWJ 0.000822 0.977121
EWJ 1.187012 0.552387
EWJ 1.461101 0.691277
US GDP 1.136620 0.286368
US GDP 7.608279 0.022278
US GDP 7.699262 0.052654
US CPI 3.955548 0.046717
US CPI 4.377666 0.112047
US CPI 9.993797 0.018619
AAPL Intraday 0.771463 0.379765
AAPL Intraday 10.109284 0.006380
AAPL Intraday 10.853067 0.012548
MSFT Intraday 0.098727 0.753363
MSFT Intraday 2.301704 0.316367
MSFT Intraday 2.608653 0.455975

In summary, while the AR(1) model provides a useful baseline with modest improvements in some cases, especially for macroeconomic data, it does not universally outperform classical benchmarks. These results underscore the importance of model selection tailored to the specific characteristics of each time series, with simpler models sometimes providing robust and reliable forecasts, particularly for high-frequency or more volatile data.

2.5 Monte Carlo Simulation

This section utilizes a Monte Carlo simulation based on the geometric Brownian motion (GBM) model (Black & Scholes, 1973), a widely adopted framework for modeling stock prices, to project future price paths for SPY and EWJ. These assets are chosen due to their rich historical datasets, which support more reliable simulations.

I generate 1,000 simulated price trajectories spanning a 30-day horizon, based on historical log returns. The simulation assumes that log returns are normally distributed with parameters estimated from empirical mean and standard deviation of the historical series. The simulated log returns are cumulatively summed and exponentiated to produce price paths, scaled from the last observed price.

These simulated paths are shown in Figure 5 as fan charts depicting the 50% and 90% prediction intervals. The median path is highlighted as a robust measure of central tendency, preferred over the mean due to the potential skewness of simulated prices.

Furthermore, Figure 6 displays the KDE of terminal prices, conveying the probability distribution of outcomes at the conclusion of the simulation period.

This probabilistic framework offers valuable insights into potential future price scenarios, laying the groundwork for subsequent analysis.

Figure 5: Fan chart of 1,000 Monte Carlo simulated price paths for the next 30 days, showing the median forecast along with 50% and 90% prediction intervals.

The fan chart of the Monte Carlo simulations shows that projected future prices for EWJ exhibit less dispersion and narrower prediction intervals compared to SPY, indicating relatively higher stability and predictability for EWJ over the next 30 days.

Notably, the EWJ price series operates on a lower absolute price scale compared to SPY, which can make direct visual comparison more challenging. However, when adjusting for scale, it can be observed that the prediction intervals for EWJ are comparatively tighter relative to its price level. This suggests that, while EWJ prices fluctuate at a lower absolute level, their relative uncertainty in percentage terms may be smaller or similar, which adds nuance to the interpretation of volatility between the two assets. Furthermore, the observed linear growth in uncertainty over time reflects a key characteristic of GBM.

The KDEs display the right-skewed shape typical of lognormal price distributions under geometric Brownian motion and typical for financial asset prices. For SPY, the median terminal price is 601.48, with a 90% coverage interval ranging from 534.94 to 668.02. In the case of EWJ, the median is 74.23 and the 90% interval spans from 59.43 to 89.03. In both cases, current prices lie close to the respective medians.

The probability of the simulated future prices falling below the current observed price is approximately 50% for both assets, reflecting the martingale property of the price process assumed by the model, implying no directional bias in short-term price movements.

It is important to emphasize that these findings rely on the geometric Brownian motion framework, which assumes constant drift and volatility and does not capture features like volatility clustering, jumps, structural breaks or mean reversion that may exist in real market data.

Figure 6: Kernel density estimate (KDE) of the simulated terminal prices after 30 days, illustrating the probability distribution of possible outcomes from the Monte Carlo simulation.

In summary, Monte Carlo simulation provides a valuable framework for understanding the distribution of potential future price paths under the assumption of GBM. It enables intuitive visualization of uncertainty, through fan charts and density plots, and supports key functions in risk management and expectation setting.

Nevertheless, GBM is grounded in several simplifying assumptions, as previously noted. While useful for illustrating baseline dynamics, these assumptions may oversimplify real-world behavior, as hinted in Figure 1 and evident in the symmetric spread of outcomes shown in Figure 5.

To address these limitations and incorporate time-dependence and memory effects observed in high-frequency data, Section 3 extends the simulation framework by replacing the random walk assumption with a more flexible AR(p) process. This approach allows for serial correlation in returns while preserving the Monte Carlo framework, making it more suitable for short-term forecasting on high-frequency datasets such as currency pairs like CHF/USD and AUD/USD.

Finally, in Section 4, I benchmark the performance of ARIMA, applied automatically and systematically to the chosen set of indicators. In addition, GARCH/ARCH models are introduced to specifically capture volatility dynamics in high frequency data presented in Section 3, acknowledging the importance of time-varying risk and conditional heteroskedasticity in financial returns.

3 Part 2

In this section, two high-frequency datasets, AUD/USD and CHF/USD exchange rates from January 1, 2013 to December 31, 2013, are analyzed. Autoregressive AR(p) models are fitted to the data and 25-path Monte Carlo simulations are generated to explore potential future dynamics.

3.0.1 Datasets and Visualisation

The datasets are sourced from CSV files downloaded from the University of London’s VLE. Although three years of data are available, only the most recent year was used due to local computational constraints. The data were resampled to 5-minute and 10-minute intervals for comparative analysis. As shown in Figure 7, there is no visually significant difference between the two sampling frequencies for either series. Both exhibit strong trends, potential seasonality and fluctuations that require further processing and transformations to achieve stationarity.

Figure 7: Line plot of high-frequency exchange rate series for both AUD/USD and CHF/USD.

3.1 Data Preprocessing

The data are first split into training and test sets. To satisfy the assumptions of AR(p) models, the series are transformed to achieve stationarity by applying the natural logarithm followed by first-order differencing. The ADF test confirms stationarity, consistent with the approach used previously in Section 2.2. A summary of the preprocessing steps is provided in Table 8.

Table 8: Summary of Stationarity and Transformation Results for High Frequency Time Series
Series Log Applied Differenced Diff Order Final Stationary?
AUD/USD 5-min True True 1 True
CHF/USD 5-min True True 1 True
AUD/USD 10-min True True 1 True
CHF/USD 10-min True True 1 True

Stationarity is also visually supported by the transformed series shown in Figure 8.

Figure 8: Line plots of the transformed training sets for AUD/USD and CHF/USD.

3.2 PACF, ACF and AR Modeling

Figure 9 and Figure 10 display the partial autocorrelation and autocorrelation functions for the last 10 lags. In addition to the previously discussed PACF, this section includes the autocorrelation function (ACF), which typically exhibits a gradual decay in autoregressive processes.

Figure 9: Partial Autocorrelation Function (PACF) plots for AUD/USD and CHF/USD with 10 lags.

Similar to the analysis presented in Section 2.4, the PACFs of all series across both sampling frequencies do not exhibit distinct cut-offs, as would be expected in ideal AR processes. For AUD/USD at the 5-minute frequency, the PACF shows potential some degree of influence at multiple lags between 1 and 10, excluding lag 8. At the 10-minute frequency, the PACF shows near-threshold signals at lags 1, 2, 6, 7, 9 and 10. In contrast, the CHF/USD PACF at 5 minutes presents a pronounced spike at lag 1, with additional signals from lags 2 to 9, while the 10-minute series shows spikes at lags 1 and faint indications at 6, 7, 8 and 9.

These results reinforce the idea that visually determining the optimal lag length is challenging, especially in noisy high-frequency financial data. This effect is particularly notable in exchange rate series, which are known to follow near-random walk dynamics and exhibit weak, short-lived serial correlations.

Regarding the ACFs, although they more closely resemble the gradual decay expected in AR(p) processes, they still deviate from textbook examples. For instance, the AUD/USD ACF at both 5-minute and 10-minute frequencies does not exhibit a clear decay pattern but instead fluctuates near zero. In contrast, the CHF/USD ACF at 5 minutes shows a more consistent decline across lags, though not all lags approach zero, with similar behavior at the 10-minute frequency.

These observations further emphasize that changes in sampling frequency can impact the correlation structure of the series, affecting both model identification and interpretation.

Figure 10: Autocorrelation Function (ACF) plots for AUD/USD and CHF/USD with 10 lags.

3.3 Forecasting and Evaluation

Following the visual inspection of partial autocorrelation structures, four autoregressive models, AR(1), AR(2), AR(3) and AR(4), are estimated and evaluated. These models were selected due to the generally weak autocorrelations observed across the series and based on Occam’s Razor, favoring simpler lag structures (one to four lags) to reduce the risk of overfitting. Additionally, experimental results indicated no meaningful improvement beyond three lags. Forecast accuracy is assessed using the same metrics described in Section 2.3. Model selection is based on the lowest RMSE, which penalizes larger forecast errors.

As shown in Table 9, for AUD/USD at the 5-minute frequency, AR(1), AR(2) and AR(3) models perform equivalently across all metrics, each achieving an RMSE of 0.000300 and substantially outperforming the AR(4) model, which shows an high RMSE of 0.891200. At the 10-minute frequency, AR(2) slightly outperforms AR(1) and AR(3) with the lowest RMSE of 0.001419, whereas AR(4) again performs poorly.

For CHF/USD, the AR(1), AR(2) and AR(3) models yield nearly identical and very low forecast errors at both 5- and 10-minute frequencies, with RMSE values around 0.000056 and 0.000003, respectively. The AR(4) model, however, exhibits markedly worse performance, with RMSE values close to 0.89, indicating poor fit.

Table 9: Evaluation of AR(1) to AR(4) models for all high-frequency series.
unique_id metric AR(1) AR(2) AR(3) AR(4)
AUD/USD 5-min mae 0.000300 0.000300 0.000300 0.891200
AUD/USD 5-min mape 0.000336 0.000336 0.000336 1.000000
AUD/USD 5-min rmse 0.000300 0.000300 0.000300 0.891200
AUD/USD 10-min mae 0.001431 0.001419 0.001420 0.891211
AUD/USD 10-min mape 0.001606 0.001592 0.001593 1.000012
AUD/USD 10-min rmse 0.001431 0.001419 0.001420 0.891211
CHF/USD 5-min mae 0.000056 0.000056 0.000056 0.892799
CHF/USD 5-min mape 0.000063 0.000063 0.000063 0.999999
CHF/USD 5-min rmse 0.000056 0.000056 0.000056 0.892799
CHF/USD 10-min mae 0.000003 0.000003 0.000003 0.892397
CHF/USD 10-min mape 0.000003 0.000003 0.000003 0.999996
CHF/USD 10-min rmse 0.000003 0.000003 0.000003 0.892397

Based on RMSE, the best-performing models are:

  • AUD/USD 5-min: AR(1), AR(2), or AR(3) (all equivalent, with AR(1) preferred for its simpler parameterization).
  • AUD/USD 10-min: AR(2).
  • CHF/USD 5-min: AR(1).
  • CHF/USD 10-min: AR(3).

Figure 11 shows diagnostic plots for the selected AR(p) models using 35 lags. Residuals appear random over time, roughly normally distributed and show no significant autocorrelation. These results confirm the models fit well and are suitable for further analysis.

(a) AUD/USD 5-min.
(b) AUD/USD 10-min.
(c) CHF/USD 5-min.
(d) CHF/USD 10-min.
Figure 11: AR Residual Diagnostics.

3.4 Monte Carlo Simulation Forecasting and KDE

The final section of Part 2 presents 25 Monte Carlo simulations for all series, along with KDE plots of the simulated terminal prices. These plots provide insight into the empirical distribution of future prices and their associated risk profiles.

Figure 12 displays 25 Monte Carlo simulation paths generated from the fitted AR(p) models for the AUD/USD and CHF/USD exchange rates, at both 5-minute and 10-minute frequencies. Each panel displays 50% and 90% prediction intervals (PIs) derived from the simulated distributions. The median is reported as a more robust measure of central tendency than the mean, particularly in the presence of skewed distributions. Simulations are initialized at recent median price levels—approximately 0.8909 for the 5-minute AUD/USD and 0.8928 for CHF/USD, with corresponding 10-minute medians at 0.8926 and 0.8924, respectively. Across all series, the median forecast path evolves gradually, with limited directional movement, reflecting the low persistence typically observed in high-frequency FX data. The 50% PIs remain relatively narrow over the forecast horizon, while the 90% PIs expand more substantially, illustrating the compounding of uncertainty over time. Despite the change in sampling frequency, the qualitative behavior of the simulations remains consistent: the paths reflect weak autocorrelation structures and high volatility, characteristic of currency markets.

Notably, while the AR(p) models employed are stationary, the simulated medians visually resemble a drift-like evolution, which is often interpreted as weak predictability in systems with near-zero autocorrelation. The close similarity between results at the two frequencies suggests that temporal aggregation at this scale has minimal impact on forecast dynamics. Overall, the findings support the notion that even simple linear models can produce plausible short-term scenarios in FX markets, despite the underlying stochastic processes offering limited predictive content.

Figure 12: Monte Carlo simulation paths (25) generated using AR models.

Moreover, Figure 13 displays nearly identical KDE shapes across different frequencies and series. Values are approximately bounded between 0.890 and 0.892 for the AUD/USD 5-minute series, with similar ranges observed for the AUD/USD 10-minute series and CHF/USD at both frequencies, typically falling between 0.891 and 0.894. The distributions exhibit slight left skewness and minor secondary modes near the lower bound, just outside the 90% coverage interval, indicating subtle multimodality. These features reflect the uncertainty in terminal prices captured by the Monte Carlo simulations shown in Figure 12, illustrating the inherent variability and clustering of possible future FX price outcomes. The fact that current prices lie very close to the median of these distributions reinforces both the short forecasting horizon of 15 steps and the typically mean-reverting, low-volatility nature of currency movements over short intervals.

Figure 13: Kernel density estimate of terminal simulated FX prices from Monte Carlo AR simulations.

In conclusion, while the AR(p) models provide a useful and computationally efficient framework for short-term FX price simulation, several limitations should be noted. The simplicity of the models, linear with few lags and no volatility dynamics, means they cannot capture important FX market features such as volatility clustering or sudden regime shifts. Additionally, the relatively low number of Monte Carlo paths might limits the granularity of the estimated price distributions, especially in the tails. Finally, the stationarity assumption and the short simulation horizon mean that these results should be interpreted cautiously for longer-term forecasting or risk management purposes.

4 Part 3

In this section, we fit AutoRegressive Integrated Moving Average (ARIMA) models (Box & Jenkins, 1970) to the previously used datasets, excluding the annual socio-economic series and compare their performance with the forecasting methods discussed in Section 2. I then evaluate ARCH/GARCH models (Bollerslev, 1986; Engle, 1982), applying the best-performing specifications to enhance the simulation of 5-minute frequency data for AUD/USD and CHF/USD, as introduced in Section 3.

4.1 ARIMA Forecasting and Evaluation

ARIMA models are estimated using the automatic procedure proposed by Hyndman and Khandakar (2008), implemented via the StatsForecast library, which replicates the original R implementation. The procedure selects the optimal autoregressive (\(p\)), differencing (\(d\)) and moving average (\(q\)) orders, as well as the seasonal orders (\(P,D,Q\)), by minimizing the corrected Akaike Information Criterion (AICc) (Akaike, 1974), using conditional sums-of-squares followed by maximum likelihood estimation (MLE). Stationarity is assessed through the ADF test.

The resulting ARIMA models are used to produce one-step-ahead forecasts. Their predictive performance is reported and compared in Table 10.

Table 10: Comparison of ARIMA model performance with classical forecasting methods.
unique_id metric Naive HistoricAverage RWD WindowAverage AR(1) AutoARIMA
AAPL Intraday mae 0.020000 0.046325 0.022222 0.053367 0.031102 0.042781
AAPL Intraday mape 0.000102 0.000236 0.000113 0.000272 0.000158 0.000218
AAPL Intraday rmse 0.020000 0.046325 0.022222 0.053367 0.031102 0.042781
EWJ mae 0.610000 46.102891 0.619249 0.320000 0.599073 0.591019
EWJ mape 0.008288 0.626398 0.008414 0.004348 0.008140 0.008030
EWJ rmse 0.610000 46.102891 0.619249 0.320000 0.599073 0.591019
MSFT Intraday mae 0.010000 0.177587 0.013939 0.045000 0.035309 0.136820
MSFT Intraday mape 0.000021 0.000374 0.000029 0.000095 0.000074 0.000289
MSFT Intraday rmse 0.010000 0.177587 0.013939 0.045000 0.035309 0.136820
SPY mae 6.750000 380.062997 6.822677 5.730000 6.996739 6.749996
SPY mape 0.011307 0.636621 0.011428 0.009598 0.011720 0.011307
SPY rmse 6.750000 380.062997 6.822677 5.730000 6.996739 6.749996
US CPI mae 0.259000 198.216794 0.059255 0.676333 0.524853 0.259004
US CPI mape 0.000808 0.618307 0.000185 0.002110 0.001637 0.000808
US CPI rmse 0.259000 198.216794 0.059255 0.676333 0.524853 0.259004
US GDP mae 252.774000 22525.134471 157.980752 604.807333 151.578270 252.799710
US GDP mape 0.008432 0.751423 0.005270 0.020176 0.005057 0.008433
US GDP rmse 252.774000 22525.134471 157.980752 604.807333 151.578270 252.799710

The evaluation results show that automatically configured ARIMA models provide limited improvement over simpler benchmark methods across the datasets tested. In addition, examining the parameters in Table 11 helps explain this outcome:

  • ARIMA does not consistently outperform Naive or RWD models in important metrics like MAE and RMSE. For example, on the AAPL Intraday and MSFT Intraday datasets, ARIMA’s forecast errors are noticeably higher, indicating challenges in capturing rapid fluctuations and noise in high-frequency financial data.
  • For macroeconomic series such as US GDP and US CPI, the performance of ARIMA models is comparable to, or slightly worse than, simple benchmark models. This may reflect the inherent complexity of these series. As illustrated in Figure 2, even after standard transformations and despite passing the ADF test for stationarity, the series still appear to contain underlying trends or signals. This raises concerns about potential false positives from the stationarity tests, which can challenge both automated model selection procedures and the ARIMA framework more broadly, given its reliance on stationarity assumptions. It is surprising that for both US GDP and US CPI, which are already seasonally adjusted, the seasonal differencing parameter is still set to D=1. This suggests potential over-differencing, which can introduce unnecessary complexity and distort the underlying signal.
  • The ARIMA models selected by the automatic procedure mostly rely on simple structures, with p=1, d=0 or 1, and q=0, and seasonal orders fixed at (P,D,Q) = (0,1,0). These closely resemble simple first-order autoregressive (AR(1)) or seasonal random walk processes (Box & Jenkins, 1970), thereby limiting their capacity to capture more complex temporal dependencies or nonlinear dynamics beyond basic autoregressive structures.
  • Automatic selection criteria like AICc and ADF tests may not always identify the best model, especially in volatile or noisy datasets. This can lead to underfitting or overfitting, reducing generalization performance.
  • The poor ARIMA results on intraday financial data suggest instability in parameter estimation due to noise and rapid changes. ARIMA models lack robustness to such characteristics without further tuning or regularization.
Table 11: ARIMA model best-fit parameters
series_name p d q P D Q
SPY 1 0 0 0 1 0
EWJ 1 2 0 0 1 0
US GDP 1 0 0 0 1 0
US CPI 1 0 0 0 1 0
AAPL Intraday 1 1 0 0 1 0
MSFT Intraday 1 1 0 0 1 0

In summary, while AutoARIMA is fast and simple to implement and does not require much technical expertise, the results suggest that applying ARIMA out-of-the-box with automatic parameter selection is often inadequate for datasets exhibiting complex dynamics or high levels of noise.

As a result, meaningful improvements may require:

  • Careful preprocessing steps such as variance stabilization and manual deseasonalization to improve model interpretability and stationarity.
  • Incorporation of exogenous variables to capture structural drivers not accounted for by univariate ARIMA models.
  • For high-frequency data, combining ARIMA with volatility models like GARCH or adopting alternative models better suited to handle rapid fluctuations and noise.

4.2 Extended Analysis: Volatility Modeling and Simulation in High-Frequency Data

Building on the AR(p) modeling and simulation in Section 3, the analysis is extended by fitting ARCH and GARCH models to 5-minute AUD/USD and CHF/USD data. A hybrid AR(1)-GARCH(1,1) model is then constructed to simulate 25 future FX paths. This type of hybrid model is commonly used in real-world applications to simulate currency movements under volatility clustering.

Although StatsForecast provides ARCH/GARCH functionality, issues were encountered during simulation, specifically, the estimated \(\omega\) parameter was too large, leading to explosive future price paths (up to 5 times the original level). To address this, the arch package (Sheppard, 2023) is used instead, resulting in more stable and realistic simulations.

As shown in Table 12, various ARCH and GARCH specifications with parameters \(p\) and \(q\) up to 2 are evaluated.

Table 12: ARCH/GARCH evaluation table with different (p, q) parameter combinations
model unique_id metric ARCH(1) ARCH(2) GARCH(1,1) GARCH(2,1) GARCH(2,2)
AUD/USD 5-min mae 0.891009 0.891009 0.890919 0.890909 0.890911
AUD/USD 5-min mape 0.999786 0.999786 0.999684 0.999673 0.999675
AUD/USD 5-min rmse 0.891009 0.891009 0.890919 0.890909 0.890911
CHF/USD 5-min mae 0.892642 0.892642 0.892642 0.892634 0.892608
CHF/USD 5-min mape 0.999823 0.999823 0.999823 0.999814 0.999784
CHF/USD 5-min rmse 0.892642 0.892642 0.892642 0.892634 0.892608

The evaluation of ARCH and GARCH models on AUD/USD and CHF/USD 5-minute data shows very similar performance across all metrics, MAE, MAPE and RMSE, indicating limited gains from increasing model complexity.

For AUD/USD, RMSE slightly improves from 0.891009 with ARCH(1) and ARCH(2) to 0.890909 with GARCH(2,1), while MAE remains nearly identical and MAPE decreases marginally from 0.999786 to 0.999673. Similarly, CHF/USD exhibits minimal RMSE improvement from 0.892642 (ARCH(1) and ARCH(2)) to 0.892608 (GARCH(2,2)), with equally negligible changes in MAE and MAPE.

The consistent closeness of these metrics across models suggests that more complex specifications like GARCH(2,1) or GARCH(2,2) provide only marginal accuracy gains on unseen data. Given that RMSE is the preferred metric due to its sensitivity to larger errors, these improvements are minimal and likely do not justify the added model complexity.

Therefore, choosing a simpler model such as GARCH(1,1) strikes a favorable balance, delivering comparable accuracy with better parsimony, increased stability and interpretability, especially important for simulation and forecasting purposes in high-frequency FX volatility modeling.

Before proceeding with the simulation, Figure 14 presents the raw residuals, histogram of standardized residuals, ACF of standardized residuals and ACF of squared standardized residuals for the GARCH(1,1) model.

(a) AUD/USD 5-min.
(b) CHF/USD 5-min.
Figure 14: GARCH(1,1) Residual Diagnostics.

The diagnostic plots indicate that both raw and standardized residuals exhibit properties broadly consistent with white noise. The Q-Q plots display an approximately linear shape, suggesting that the standardized residuals are reasonably normally distributed. Histograms further support this impression, showing near-Gaussian forms. While the ACF of both standardized residuals and their squares generally show low values, the latter display mild autocorrelations at certain lags that may be marginally significant. Nonetheless, these deviations are limited and the overall diagnostics support the GARCH(1,1) model as an adequate specification for both exchange rate series, with a slightly better fit observed for AUD/USD than for CHF/USD.

Finally, Figure 15 and Figure 16 present the Monte Carlo simulations and the corresponding KDE of terminal exchange rate prices for both series. As illustrated in the former, the hybrid AR(1)-GARCH(1,1) model outperforms the AR(1) model alone by producing richer dynamics and more realistic uncertainty bands, particularly at the 50% and 90% prediction intervals. Compared to the AR(1), only simulations in Figure 12, the hybrid model captures wider and more nuanced distributions, suggesting improved modeling power, an advantage that may translate well into applications such as risk modeling and the development of trading strategies.

Figure 15: Monte Carlo simulation paths (25) generated using AR-GARCH models.

Moreover, the KDE plots in Figure 16 reveal that the terminal prices generated by the AR-GARCH simulations are closer to a normal distribution and exhibit a slight right skew. This contrasts with the heavier distortions observed in Figure 13 under the AR-only specification, further supporting the benefit of incorporating GARCH(1,1) dynamics even with a limited number of simulated paths.

Figure 16: Kernel density estimate of terminal simulated FX prices from Monte Carlo AR-GARCH simulations.

In summary, the results demonstrate that incorporating volatility dynamics through the GARCH(1,1) component markedly improves the realism and robustness of exchange rate simulations compared to simpler AR models.

Nonetheless, several limitations should be noted. The relatively small number of simulation paths may constrain the model’s capacity to capture tail risks and extreme events comprehensively. Additionally, the assumption of stationarity may reduce effectiveness in the presence of structural breaks or regime shifts frequently observed in FX markets, particularly over longer forecasting horizons or with different training datasets. Future research could focus on scaling up simulations and extending the framework to include regime-switching or stochastic volatility models to better address these complexities.

5 Conclusion

This project systematically evaluated statistical time series models across a range of real-world datasets. The findings confirm that even parsimonious models such as the Naive and Random Walk with Drift can perform remarkably well in specific domains, notably in financial and economic forecasting where short-term persistence dominates.

Autoregressive models, including AR and ARIMA, demonstrated solid performance but also revealed limitations when applied to complex or noisy datasets. The integration of Monte Carlo simulations and KDE proved to be effective for visualizing the distribution of terminal price outcomes and quantifying forecast uncertainty. However, the uncertainty bounds generated by different models were inherently shaped by their underlying assumptions. For example, while Geometric Brownian Motion offered a simple linear framework, it lacked responsiveness to changing volatility, unlike AR-GARCH models which provided more adaptive, state-dependent uncertainty estimates.

The focus on one-step-ahead forecasts ensured a consistent and controlled evaluation framework, though it inherently limits insight into longer-term dynamics, an area warranting further exploration. Additional limitations include sensitivity to structural breaks and the relatively low number of Monte Carlo paths, which may underrepresent tail risk and extreme event scenarios.

In conclusion, the study underscores both the strengths and constraints of classical statistical models, providing a robust foundation for future research. It also highlights the potential for more flexible and adaptive methodologies, such as machine learning, to better accommodate the nonlinear and regime-dependent behaviors often observed in real-world time series data.

6 References

Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716–723. https://doi.org/10.1109/TAC.1974.1100705
Black, F., & Scholes, M. (1973). The pricing of options and corporate liabilities. Journal of Political Economy, 81(3), 637–654. https://doi.org/10.1086/260062
Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics, 31(3), 307–327. https://doi.org/10.1016/0304-4076(86)90063-1
Box, G. E. P., & Jenkins, G. M. (1970). Time series analysis: Forecasting and control. Holden-Day.
Cleveland, R. B., Cleveland, W. S., McRae, J. E., & Terpenning, I. (1990). STL: A seasonal-trend decomposition procedure based on loess. Journal of Official Statistics, 6(1), 3–73.
Dickey, D. A., & Fuller, W. A. (1981). Likelihood ratio statistics for autoregressive time series with a unit root. Econometrica, 49(4), 1057–1072. https://doi.org/10.2307/1912517
Engle, R. F. (1982). Autoregressive conditional heteroscedasticity with estimates of the variance of united kingdom inflation. Econometrica, 50(4), 987–1007. https://doi.org/10.2307/1912773
Garza, A., Mergenthaler Canseco, M., Challú, C., & Olivares, K. G. (2022). StatsForecast: Lightning fast forecasting with statistical and econometric models. Presented at PyCon, Salt Lake City, Utah, USA. https://github.com/Nixtla/statsforecast
Hamilton, J. D. (1994). Time series analysis (reprint 2020). Princeton University Press.
Hyndman, R. J., & Athanasopoulos, G. (2021). Forecasting: Principles and practice (3rd ed.). OTexts. https://otexts.com/fpp3/
Hyndman, R. J., & Khandakar, Y. (2008). Automatic time series forecasting: The forecast package for r. Journal of Statistical Software, 27(3), 1–22. https://doi.org/10.18637/jss.v027.i03
Ljung, G. M., & Box, G. E. P. (1978). On a measure of a lack of fit in time series models. Biometrika, 65(2), 297–303. https://doi.org/10.1093/biomet/65.2.297
MarketBeat.com. (2025). iShares MSCI japan ETF (EWJ) price, holdings, & news. https://www.marketbeat.com/stocks/NYSEARCA/EWJ/
Sheppard, K. (2023). Arch: ARCH models in python. University of Oxford. https://arch.readthedocs.io/
Walker, G. (1931). On periodicity in series of related terms. Proceedings of the Royal Society of London. Series A, Containing Papers of a Mathematical and Physical Character, 131(818), 518–532. https://doi.org/10.1098/rspa.1931.0065
Yule, G. U. (1927). On a method of investigating periodicities in disturbed series, with special reference to wolfer’s sunspot numbers. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 226, 267–298. https://doi.org/10.1098/rsta.1927.0007